Workshop Day 2A | 2022-07-26 Jeffrey M. Girard | Pitt Methods
Wrangle II
Basic wrangling verbs
tidyverse provides tools for wrangling tibbles
These functions are named after verbs
So if you name your objects after nouns…
…your code becomes easier to read
Noun(noun) ❌
Verb(noun) ✔️
blender(fruit)
blend(fruit)
screwdriver(screw)
drive(screw)
boxcutter(box)
cut(box)
Basic wrangling verbs
Primary Functions (most used)
select() retains only certain columns
mutate() adds or transforms columns
filter() retains rows based on criteria
Secondary Functions (less used)
arrange() sorts rows by their values
rename() changes column titles
relocate() moves columns around
Select Live Coding
# SETUP: Load package and inspect example tibblelibrary(tidyverse) # includes the dplyr packagestarwars# ==============================================================================# USECASE: Retain only the specified variablessw <-select(starwars, name)swsw <-select(starwars, name, sex, species)sw# ==============================================================================# PITFALL: Don't forget to save the change with assignmentselect(starwars, name, sex, species)starwars # still includes all variables# ==============================================================================# USECASE: Change the order of variablessw <-select(starwars, species, name, sex)sw# ==============================================================================# USECASE: Retain all variables between two variablessw <-select(starwars, name, hair_color:eye_color)sw# ==============================================================================# USECASE: Retain all variables except the specified onessw <-select(starwars, -sex, -species)swsw <-select(starwars, -c(sex, species))swsw <-select(starwars, -c(hair_color:starships))sw
Rename Live Coding
# USECASE: Change the name of one or more variables# TEMPLATE: df2 <- rename(df, new_name = old_name)starwarssw <-rename(starwars, Character = name)swsw <-rename(starwars, height_cm = height, mass_kg = mass)sw# ==============================================================================# PITFALL: Don't swap the order and try old_name = new_namesw <-rename(starwars, name = Character) # error
Relocate Live Coding
# USECASE: Move variables before another variable or positionstarwarssw <-relocate(starwars, sex, .before = height)swsw <-relocate(starwars, species, sex, .before = name)swsw <-relocate(starwars, homeworld, .before =1)sw# ==============================================================================# PITFALL: Don't forget the period!sw <-relocate(starwars, sex, before = height) sw # height was accidentally renamed to before# ==============================================================================# USECASE: Move variables after another variable or positionsw <-relocate(starwars, sex, .after = height)swsw <-relocate(starwars, species, sex, .after = name)swsw <-relocate(starwars, homeworld, .after =1)sw
Arrange Live Coding
# USECASE: Sort observations by a variablestarwarssw <-arrange(starwars, height)sw # sorted by height, ascendingsw <-arrange(starwars, name)sw # sorted by name, alphabetically# ==============================================================================# USECASE: Sort observations by a variable, in reverse ordersw <-arrange(starwars, desc(height))sw # sorted by height, descendingsw <-arrange(starwars, desc(name))sw # sorted by name, reverse-alphabetically# ==============================================================================# USECASE: Sort observations by multiple variablessw <-arrange(starwars, hair_color, mass)sw # sorted by hair_color, then ties broken by mass
Filter Live Coding
# USECASE: Retain only observations that meet a criterionsw <-filter(starwars, mass >100)sw # only observations with mass greater than 100sw <-filter(starwars, mass <=100)sw # only observations with mass less than or equal to 100sw <-filter(starwars, species =="Human")sw # only observations with species equal to Humansw <-filter(starwars, species !="Human")sw # only observations with species not equal to Human# ==============================================================================# PITFALL: Don't try to use a single = for testing equalitysw <-filter(starwars, height =150) # errorsw <-filter(starwars, height ==150) # correctsw # ==============================================================================# PITFALL: Don't forget that R is case-sensitivesw <-filter(starwars, species =="human")sw # no observations left (because it should be Human)# ==============================================================================# USECASE: Retain only observations that meet complex criteriasw <-filter(starwars, mass >100& height >200)sw # only observations with mass over 100 AND height over 200sw <-filter(starwars, height <100| hair_color =="none")sw # only observations with height under 100 OR hair_color equal to none# ==============================================================================# PITFALL: Don't forget to complete both conditionssw <-filter(starwars, mass >100&<200) # errorsw <-filter(starwars, mass >100& mass <200) # correctsw# ==============================================================================# PITFALL: Don't try to equate a string to a vectorsw <-filter(starwars, species ==c("Human", "Droid")) # errorsw <-filter(starwars, species %in%c("Human", "Droid")) # correctsw
# SETUP: Enable the pipe operator shortcut# Tools > Global Options... > Code tab > Check "Use Native Pipe Operator"# Type out |> or press Ctrl+Shift+M (Windows) / Cmd+Shift+M (Mac)# ==============================================================================# LESSON: The pipe pushes objects to a function as its first argument# TEMPLATE: x |> function_name() is the same as function_name(x)x <-10y <-sqrt(x)yy <- x |>sqrt()y# ==============================================================================# PITFALL: Don't forget to remove the object from the function callx |>sqrt(x) # wrongx |>sqrt() # correct# ==============================================================================# USECASE: You can still use arguments when pipingz <-round(3.14, digits =1)zz <-3.14|>round(digits =1)z# ==============================================================================# USECASE: Pipes are useful with tibbles and wrangling verbsstarwarssw <-select(starwars, name, species, height)swsw <- starwars |>select(name, species, height)sw# ==============================================================================# PITFALL: Don't add a pipe without a step after itsw <- starwars |>select(name, species, height) |># error
Pipelines Live Coding
# USECASE: You can chain multiple pipes together to make a pipelinex <-10|>sqrt() |>round()x# ==============================================================================# TIP: If you want to see the output of a pipeline, you can pipe to print()x <-10|>sqrt() |>round() |>print()# ==============================================================================# TIP: To make your pipelines more readable, move each step to a new linex <-10|>sqrt() |>round() |>print()# ==============================================================================# PITFALL: Don't put the pipe at the beginning of a line, thoughx <-10|>sqrt()|>round()|>print() # error# ==============================================================================# USECASE: Chain together a series of verbs to flexibly wrangle datatallones <- starwars |>select(name, species, height) |>rename(height_cm = height) |>mutate(height_ft = height_cm /30.48) |>filter(height_ft >7) |>arrange(desc(height_ft)) |>print()
Factors
Factors are used to represent categorical data
Factors have multiple possible levels
Levels are discrete and mutually-exclusive
Sometimes categories are unordered (nominal)
Action or Comedy or Drama
Asia or Europe or North America
Sometimes categories are ordered (ordinal)
Mild < Medium < Hot
XS < S < M < L < XL
Factors Live Coding
# USECASE: Ask 10 kids to order 1: nuggets, 2: pizza, or 3: saladfood <-c(2, 2, 1, 2, 1, 2, 1, 1, 2, 2)food# ==============================================================================# LESSON: We can turn this vector into a factor with factor()food2 <-factor(food, levels =c(1, 2, 3))food2food3 <-factor(food, levels =c(1, 2, 3),labels =c("nuggets", "pizza", "salad"))food3# ==============================================================================# USECASE: We can also quickly and easily count each level with table()table(food3)# ==============================================================================# PITFALL: Don't confuse levels and labelsfood4 <-factor(food, labels =c(1, 2, 3),levels =c("nuggets", "pizza", "salad"))food4 # full of <NA> because it can't find these levels# ==============================================================================# USECASE: You can also just enter strings directly (as self-labels)genre <-c("pop", "metal", "pop", "rock", "rap", "rap", "pop", "rock")genregenre2 <-factor(genre) # observed levels will be assigned alphabeticallygenre2table(genre2)# ==============================================================================# LESSON: If ordinal, enter levels low-to-high and add ordered = TRUEsalsa <-c("hot", "mild", "medium", "mild", "medium", "medium")salsa2 <-factor(salsa, levels =c("mild", "medium", "hot"), ordered =TRUE)salsa2 # NOTE: We may want to visualize or model ordinal factors differently# ==============================================================================# USECASE: Working with factors in a tibblecereal <-read_csv("cereal.csv")cerealcereal2 <-mutate(cereal, mfr =factor(mfr), type =factor(type))cereal2table(cereal2$mfr)table(cereal2$type)
Missing Values
Sometimes your data will have missing values
Perhaps these were never collected
Perhaps the values were lost/corrupted
Perhaps the participant didn’t respond
We need to tell R which values are missing
To do so, we set those values to NA
Functions from tidyverse make this easy
Missingness is often “contagious” in R e.g., a vector with NA has an unknown mean
Missing Values Live Coding
# SETUP: We will need tidyverse for the read and mutate functionslibrary(tidyverse)# ==============================================================================# PITFALL: Number codes for missingness will mess up calculations in Rheights <-c(149, 158, -999) # here we use -999 to represent a missing valuerange(heights)mean(heights)log(heights) # our missing value is no longer -999# ==============================================================================# USECASE: Use NA for missingness insteadheights2 <-c(149, 158, NA)heights2log(heights2) # the NA stayed an NA (due to contagiousness)# ==============================================================================# LESSON: Use na.rm = TRUE to do a summary function ignoring the NAsmean(heights2) # the mean is an NA (due to contagiousness)mean(heights2, na.rm =TRUE)range(heights2, na.rm =TRUE)# ==============================================================================# USECASE: Dealing with missing values in tibblescereal <-read_csv("cereal.csv")cereal$ratingrange(cereal$rating)# ==============================================================================# LESSON: Use na_if() to convert specific values to NA while mutatingcereal2 <-mutate(cereal, rating =na_if(rating, -999))cereal2$ratingrange(cereal2$rating, na.rm =TRUE)# ==============================================================================# LESSON: Use read_csv(na) to convert specific values to NA while readingcereal3 <-read_csv("cereal.csv", na ="-999")cereal3$ratingrange(cereal3$rating, na.rm =TRUE)